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M any scientific applications have intense computational 
and I/O requirements. Although multiprocessors have 
permitted astounding increases in computational per- 
formance, the formidable I/O needs of these applica- 
tions cannot be met by current multiprocessors and 
their I/O subsystems. To prevent I/O subsystems from forever bottle- 
necking multiprocessors and limiting the range of feasible applications, 
new I/O subsystems must be designed. 

The successful design of computer systems (both hardware and software) 
depends on a thorough understanding of their intended use. A system's 
designer optimizes the p>olicies and mechanisms for the cases expected to be 
most common in the user's workload. In the case of multiprocessor file 
systems, however, designers have been forced to build file systems based 
only on speculation about how they would be used, extrapolating from 
file-system characterizations of general-purpose workloads on uniproces- 
sor and distributed systems or scientific workloads on vector supercom- 
puters (see sidebar on related work). To help these system designers, in 
June 1993 we began the Charisma project, so named because the project 
sought to characterize I/O in scientific multiprocessor applications from a vari- 
ety of production parallel computing platforms and sites. 

The Charisma project is unique in recording individual read and write 
requests-in live, multiprogramming, parallel workloads (rather than from 
selected or nonparallel applications). In this article, we present the first 
results from the project: a characterization of the file-system workload on 
an iPSC/860 multiprocessor running production, parallel scientific appli- 


^ The Charisma 
project records 
individual read and 
write requests in live, 
multiprogramming, 
parallel workloads. 
This information 
can be used to 
design more efficient 
multiprocessor 
systems. 
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The iPSC/860 and CFS 

The iPSC/860 is a distributed-rnemoiy, 
message-passing, MIMD machine. 
The compute nodes are based on the 
Intel i860 processor and are connected 
by a hypercube network. I/O is handled 
by dedicated I/O nodes, which are each 
connected to a single compute node 
rather than directly to the hypercube 
interconnect. The I/O nodes are based 
on the Intel i386 processor and each is 
connected to a single SCSI disk drive. 
There may also be one or more service 
nodes that handle such things as Ether- 
net connections or interactive shells. 

The iPSC/860 at NASA Ames has 
128 compute nodes, each with 8 
Mbytes of memory, and 1 0 I/O nodes, 
each with 4 Mbytes of memory and a 
single 760 Mbyte disk drive. There is 
also a single service node that handles 


a 10-Mbit-per-second Ethernet con- 
nection to the host computer. The 
total I/O capacity is 7.6 Gbytes and the 
total bandwidth is less than 10 Mbytes 
per second. 

Intel’s CFS stripes each file across all 
disks in 4-Kbyte blocks. Compute 
nodes send requests directly to the 
appropriate I/O node for service. Only 
the I/O nodes have a buffer c'ache. CFS 
provides four I/O modes to help the pro- 
grammer coordinate parallel access. 
Mode 0 gives each process its own file 
pointer; mode 1 shares a single file 
pointer among all processes; mode 2 is 
like mode 1, but enforces a round- 
robin ordering of accesses across ail 
nodes; and mode 3 is like mode 2, but 
restricts the access sizes to be identical. 
More details about CFS and its perfor- 


mance can be found in work by Pierce,^ 
Nitzberg,‘ and French et al.^ 
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cations at NASA’s Ames Research Center. We use the 
resulting information to address the following questions: 

• What did the job mix look like — that is, how many 
jobs ran concurrently? How many processors did 
each use? How many files? 

• How many files were read and written? Which were 
temporary files? What were their sizes? 

• What were typical read and write request sizes, and 
how were they spaced in the file? Were the accesses 
sequential? In what way? 

• What forms of locality were there? How might 
caching be useful? 

• What are the implications for file-system design? 

Methods 

To be useful to a system designer, a workload charac- 
terization must be based on a realistic workload similar 
to what is expected of it in the future. This meant that we 
had to trace a multiprocessor file system that was in use 
for prodtiaion scientific computing. The Intel iPSC/860 
at NASA Ames’ Numerical Aerodynamics Simulation 
(NAS) facility met this criterion (see sidebar). (The facil- 
ity's three newer multiprocessors, an Intel Paragon, a 
Thinking Machines QM-5, and an IBM SP-2, do not yet 
have a mature production workload.) 

Ideally, a workload characterization is an architec- 
ture-independent representation of the work generated 
by a group of users in a particular type of computing 
environment. However, since the architectures of dif- 


ferent parallel I/O subsystems are so diverse, any 
observed workload will be tied to a particular machine. 
While we have tried to factor out these effects as much 
as possible, some care should be taken in generalizing 
our results. 

Data collection 

For our study, one trace file was collected for the entire 
file system. We traced only the I/O that involved the 
Concurrent File System (CFS); we did not record any 
I/O that was done through standard input and output 
or to the host file system (all limited to sequential, 
Ethernet speeds). We collected data for about 1 56 hours 
over a period of three weeks. While we did not trace 
continuously for the whole three weeks, we tried to get 
a realistic picture of the whole workload by tracing at 
different times of the day and of the week, including 
nights and weekends. The period covered by a single 
trace file ranges from 30 minutes to 22 hours. The 
longest continuously traced period was about 62.5 
hours. Tracing was usually initiated when the machine 
was idle. For those few cases in which a job was running 
when we began tracing, the job was not traced. Tracing 
was stopped in one of two ways: manually or by a system 
crash. The machine was usually idle when a trace was 
manually stopped. 

The trace files begin with a header record containing 
enough information to make the file self-descriptive, and 
continue with a series of event records (one per event). 
These events include individual read and write requests, 
as well as operations like file extensions and deletions. 
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Related work 


There has never been an extensive study 
of a production scientific workload on a 
multiprocessor file system. Related file- 
system workload studies can be classi- 
fied as characterizing general-purpose 
workstations (or workstation networks), 
scientific vector applications, or scien- 
tific parallel applications. 

General-purpose workstations. Uni- 
processor file access patterns have been 
measured many times. Ousterhout et 
al. measured isolated Unix worksta- 
tions,* and Baker etal. measured a dis- 
tributed Unix (Sprite) system.^ All of 
these studies cover general-purpose 
(engineering and office) workloads 
with uniprocessor applications. 

Scientific vector applications. Some 
studies specifically examined scientific 
workloads. Del Rosario and Choud- 
hary provide an informal characteriza- 
tion of grand-challenge applications.^ 
Miller and Katz traced specific I/O- 
intensive Cray applications to deter- 
mine the per-file access patterns, focus- 
ing primarily on access rates.^ Pasquale 
and Polyzos studied I/O-intensive Cray 
applications, focusing on patterns in 


the I/O rate.^ All of these studies are 
limited to uniprocess applications on 
vector supercomputers. 

Scientific parallel applications. Reddy 
et al. chose five sequential scientific 
applications from the Perfect bench- 
marks and parallelized them for an 
eight-processor Alliant, finding only 
sequential file-access patterns.^ This 
study is interesting, but far from what 
we need: the sample size is small; the 
programs are parallelized sequential 
programs, not parallel programs perse\ 
and the I/O itself was not parallelized. 
Cypher et al. studied individual paral- 
lel scientific applications, measuring 
temporal patterns in I/O rates.^ 
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Since one of the Charisma project’s goals is to organize 
and facilitate a muldplatform file-system tracing effort, 
we have defined a large set of event records suitable for 
both single- and multiple-instruction multiple-data sys- 
tems (SIMD and MIMD systems). 

On the iPSC/860, high-level CFS calls are imple- 
mented in a library that is linked with the user’s pro- 
gram. We instrumented the library calls to generate an 
event record each time they were called. The event 
records were buffered at each compute node and peri- 
odically sent to a data collector running on the service 
node. The collector then wrote the data to the central 
trace file (itself on CFS). The collector’s use of CFS was 
not recorded in the trace. 

Since our instrumentation was almost entirely within 
a user-level library, there were some jobs whose file 
accesses were not traced. These included most system 
programs (such as Is, cp, and ftp) as well as user programs 
that were not relinked during the period we were trac- 
ing. We did, however, record all job starts and ends 
through a separate mechanism. While we were tracing, 
3016 jobs were run on the compute nodes, of which 2237 
were only run on a single node. We actually traced at 
least 429 of the 779 multinode jobs and at least 41 of the 


single-node jobs. As a tremendous number of the single- 
node jobs were system programs, it is not surprising nor 
necessarily undesirable that so many were untraced. In 
particular, there was one single-node job that was run 
periodically and that accounted for over 800 of the sin- 
gle-node jobs, simply to check the status of the machine. 
There was no way to distinguish between a job that was 
untraced from a job that simply did no CFS I/O, so the 
numbers of traced jobs are a lower bound. 

One of our primary concerns was to minimize the 
degree that our measurement perturbed the workload. 
We identified three ways that it might do so. Our first 
concern was network contention. We expected users’ 
jobs to generate a great many event records. Had we 
sent a message to the data collector for each event 
record, we would have created unreasonable congestion 
near the collector, or perhaps in the overall machine. 
Since large messages on the iPSC are broken into 4- 
Kbyte blocks, we created a buffer of that size on each 
node to hold local event records. This buffer let us 
reduce the number of messages sent by more than 90% 
without stealing much memory from user jobs. 

The second concern was local CFS overhead. Since we 
were tracing every I/O operation in a production envi- 
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Table 1 . Summary statistics of the traces we 
collected, and of files opened.^ Only those jobs 
whose file accesses were caught by our library 
are included here. 


Jobs 


470 


Mbytes 

Read 

38,812 



Written 

44,725 


Files 

Opened 

63,779 

100% 


Read 

14,540 

22.8% 


Written 

44,500 

69.8% 


Both 

2259 

3.5% 


Neither 

2480 

3.9% 



Number of jobs 

Figure 1 . Amount of time the machine spent with the 
given number of jobs running. This data includes all 
jobs, even if their file access could not be traced. 

ronment, it was imperative that the per-call overhead be 
kept to a minimum to avoid inconveniencing the users. By 
buffering records on the compute nodes, we were able to 
avoid the cost of message passing on every call to CFS. 

Our final concern was that we might increase con- 
tention for the I/O subsystem. We tried to minimize this 
by creating a large buffer for the data collector and writ- 
ing the data to CFS in large sequential blocks. Although 
we collected about 700 Mbytes of data, our trace files 
accounted for less than 1 % of the total traffic. 

Simple benchmarking of the instrumented library 
revealed that the overhead our instrumentation added 
was virtually undetectable in many cases. The worst case 
we foimd was a 7% increase in execution time on one 
run of the NAS NHT - 1 Application-I/O Benchmark. ^ 
After the instrumented library was put into production 
use, anecdotal evidence suggests that there was no 
noticeable performance loss. 

Analysis 

The raw trace files required some simple postprocess- 
ing before they could be easily analyzed. TTiis postpro- 
cessing included data realignment, clock synchroniza- 
tion, and chronological sorting. 

Since each node buffered 4 Kbytes of data before 
sending it to the central data collector, the raw trace file 


contained only a partially ordered list of event records. 
Ordering the records was complicated by the lad of 
synchronized clocks on the iPSC/860. Each node main- 
tains its own clock; the clocks are synchronized at sys- 
tem startup but each drifts significandy and differendy 
after that. We partially compensated for the asynchrony 
by timestamping each block of records when it left die 
node and again when it was received at the data collec- 
tor. From the difference between the two we could 
approximately adjust the event order to compensate for 
each node’s clock drift relative to the collector’s clock. 
This technique allowed us to get a closer approximation 
of the event order. Nonetheless, it is still an approxi- 
mation, so much of our analysis is based on spat al, 
rather than temporal, information. 

Results 

We characterize the workload from the top down, begin- 
ning with the number of jobs in the machine and the 
number and use of files by all jobs. We then examine indi- 
vidual I/O requests by looking for sequentiality, regular- 
ity, and sharing in the access pattern. Finally, we evalu- 
ate the effect on caching through trace-driven simulation. 

Jobs 

Table 1 provides an overview of this workload’s char- 
acteristics. Figure 1 gives an initial look into the details 
behind Table 1 by showing the amount of time the 
machine spent running a given number of jobs. For 
more than a quarter of the traced period, the machine 
was idle (that is, running zero jobs). For about 35% of 
the time, it was running more than one job, sometimes 
as many as eight. Although not all jobs use the file sj's- 
tem, a file system clearly must provide high-performance 
access by many concurrent, presumably imrelated, jobs. 
While uniprocessor file systems are tuned for this situ- 
ation, most research into multiprocessor file systems has 
ignored this issue, focusing on optimizing single-job 
performance. 

Of course, some of the jobs in Figure 1 were small, 
single-node jobs, and some were large parallel jobs. Fig- 
ure 2 shows the distribution of compute nodes used by 
each job. Single-node jobs dominated the job popula- 
tion, although large parallel jobs dominated node usage. 
This dichotomy would be larger in new “self-hosting” 
parallel systems. The lesson here is that a successful file 
system must allow both small, sequential jobs and large, 
highly parallel jobs access to the same files imder a vari- 
ety of conditions and system loads. 
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Files 

In Table 1, files are classified by how they were actually 
used rather than by the mode in which they were 
opened. Note that many more files were written than 
were read (more than three times as many). It appears 
that programmers of traced applications often found it 
easier to open a separate output file for each compute 
node, rather than coordinating writes to a common out- 
put file — as evidenced by the substantially smaller aver- 
age number of bytes written per file (1.2 Mbytes) than 
average bytes read per file (3.3 Mbytes). Also, there were 
extremely few files that were read and written in the 
same open. This is common in Unix file systems^ and 
may be accentuated here by the difficulty in coordinat- 
ing concurrent reads and writes to the same file (the CFS 
file-access modes are of little help for read-write access). 
We suspect that most of the files that were not accessed 
at all were opened by applications that terminated pre- 
maturely. 

Table 2 shows that most jobs opened only a few files 
over the course of their execution, although a few 
opened many (one job opened 2217 files). Some of the 
jobs that opened a large number of files were opening 
one file per node. Although not all files were open con- 
currendy, file-system designers must optimize access to 
several files within the same job. 

We found that only 0.6 1 % of all opens were to “tem- 
porary” files (a file deleted by the same job that created 
it), and nearly all of those may have been from one appli- 
cation. The rarity of temporary files and of files that 
were both read and written in the same open indicates 
that few applications chose to use files as an extension of 
memory for an “out of core” solution. Many of the 
NASA Ames applications are computational fluid 
dynamics codes, for which out-of-core methods are in 
general too slow. 

Figure 3 shows that most of the files accessed were 
large (10 Kbytes to I Mbyte). (Because there were many 
small files and many distinct peaks across the range of 
sizes, there was no constant granularity that captured 
the detail we felt was important in a histogram. We 
chose to plot the file sizes on a logarithmic scale with 
pseudo-logarithmic bucket sizes: The bucket size 
between 10 and 100 bytes is 10 bytes; between 100 and 
1000 it is 100 bytes, and so on.) 

It is important to note that each of the largest jumps 
in the figure is primarily due to one or two applications; 
undue emphasis should not be placed on the specific 
numbers as opposed to the general tendency toward 
larger files. Although these files were larger than those 


0 

1 60 
j 40 
^ 20 

0 

1 2 4 8 16 32 64 128 

Number of compute nodes 

Figure 2. Distribution of the number of compute 
nodes used by jobs in our workload (even those whose 
file access could not be traced). The iPSC limits the 
choice to powers of 2. 



Table 2. Among traced jobs, the number of files 
opened by jobs was often small (1-4). 


No. OF RLES 

No. OF JOBS 

1 

71 

2 

15 

3 

24 

4 

120 

5+ 

240 



File size (bytes) 

Figure 3. Cumulative distribution function (CDF) of the 
number of files of each size at close. For a file size x, 
CDF(x) represents the fraction of ail files that had x or 
fewer bytes. 


in a general-purpose file system,"^ they were smaller than 
we would expect to see in a scientific supercomputing 
environment.^ We suspect that users limited their file 
sizes due to the small disk capacity (7.2 Gbytes) and lim- 
ited disk bandwidth (10 Mbytes per second peak). 

I/O REQUEST SIZES 

Figures 4 and 5 show that the vast majority of reads are 
small, but that most bytes are transferred through large 
reads. Indeed, 96.1% of all reads were for fewer than 
4000 bytes, but those reads transferred only 2.0% of all 
data read. Similarly, 89.4% of all writes were for fewer 
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1 



Figure 4. CDF of the number of reads by request size 
and of the amount of data transferred by request size. 



Write size (bytes) 

Figure 5. CDF of the number of writes by request size 
and of the amount of data transferred by request size. 


than 4000 bytes, but those writes transferred only 3 % of 
all data written. The number of small requests is sur- 
prising due to their poor performance in CFS.^ The 
jump at 4 Kbytes indicates that some users have opti- 
mized for the file-system block size, but it appears that 
most users prefer ease of programming over perfor- 
mance. 

The figures show spikes in the number of small 
requests as well as in the data transferred by f-Mbyte 
requests. While the spikes of small requests occurred 
throughout the tracing period, one trace alone (proba- 
bly one job alone) contributed the spike at 1 Mbyte. 
Although the specific position of the spikes is likely due 
to the effect of individual applications, we believe that 
the preponderance of small request sizes is the natural 
result of parallelization by distributing file data across 
many processors, and would be found in other work- 
loads using a similar file-system interface. 

Sequentiality 

A common characteristic of file workloads, particularly 
scientific workloads, is that files are accessed sequen- 
tially. We define a sequential request to be one that is at 
a higher file offset than the previous request from the 
same compute node, and a consecutive request to be a 
sequential request that begins where the previous 



Figure 6. CDF of sequential access to files on a per- 
node basis. 


request ended. Figures 6 and 7 show the amount of 
sequential and consecutive access to files with more than 
one request in our workload. 

The most notable features of these graphs are the 
spikes at 0% and 100%; most files were either entirely 
sequential (or consecutive) or not at all. Not surpris- 
ingly, access to read-write files was primarily nc»nse- 
quential. By far, most read-only and write-only files 
were 100% sequential. Most (86%) write-only files were 
100% consecutive, but that was largely due to the fact 
that most write-only files were written only by one 
processor. Only 29% of read-only files, however, were 
100% consecutive. The remainder (nonconsecutive, 
sequential read-only files) were the result of interleaved 
access, where successive records of the file are accessed 
by different nodes; from the perspective of an individ- 
ual node, some bytes must be skipped between one 
request and the next. 

I/O-REQUEST INTERVALS 

We define the number of bytes skipped to be the inter- 
val size. Consecutive accesses have interval size 0. The 
number of different interval sizes used in each file, across 
all nodes that access that file, is shown in Table 3. A sur- 
prising number of files were read or written in one 
request per node (that is, there were no intervals). <^er 
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Table 3. The number of different interval sizes 
used in each file across all participating nodes. 
Zero represents those cases where only one 
access was made to a file, per node. 


No. OF SIZES 

No. OF FILES 

Percent of total files 

0 

23,291 

36.5 

1 

37,148 

58.2 

2 

2561 

4.0 

3 

105 

0.2 

4+ 

674 

1.0 


Table 4. The number of different request sizes 
used in each file across all compute nodes. Files 
with zero different sizes were opened and 
closed without being accessed. 


No. of sizes 

No. of files 

Percent of total files 

0 

2480 

3.9 

1 

25,523 

40.0 

2 

32,779 

51.4 

3 

2510 

3.9 

4+ 

487 

0.8 


99% of the 1 -interval-size files were consecutive accesses 
(in other words, every interval had size 0). The remain- 
der of 1 -interval-size files, along with the 2 -interval-size 
files, represent 5% of all files, and indicate another form 
of highly regular access pattern. Only 1.2% of all files 
had 3 or more different interval sizes, and their regu- 
larity (if any) was more complex. 

To get a better feel for this regularity, we also counted 
the number of different request sizes used in each file, as 
shown in Table 4, More than 90% of the files were 
accessed with only one or two request sizes. Combin- 
ing the regularity of request sizes with the regularity of 
interval sizes, many applications clearly used regular, 
structured access patterns. 

Strided access 

A series of requests to a file is a svmple-stnded access pat- 
tern if each request is the same size, and if the offset of 
the file pointer is incremented by the same amount 
between each request. This would correspond, for exam- 
ple, to the series of I/O requests generated by a node with- 
in an application reading a column of data from a matrix 
stored in row-major order. A portion of the file accessed 
with a strided pattern is a strided segment. A nested-strided 
access pattern is recursively similar to a simple-strided 
access pattern, in that it is composed of strided segments 
separated by regular strides in the file. 

Higher-level analysis revealed that well over 90% of 
the accesses in the traced workload were part of either a 
simple- or a nested-strided access pattern caused by the 
distribution of data across multiple compute nodes.^ Of 
the files in the workload, 26% were accessed (at least in 
part) in a strided fashion, and nearly 1/3 of those were 
accessed in a nested-strided fashion. Of the remaining 
files, 99% either had too few accesses to exhibit any pat- 
tern, were only accessed by a single node, or were accessed 
in a consecutive pattern. Thus, less than 1 % of all files 
were accessed in an irregular, parallel access pattern. 

Synchronization 

Given the regular request sizes and interval sizes shown 
in Tables 3 and 4, Intel’s I/O modes would seem to be 
helpful. Our traces show, however, that over 99% of the 
files used mode 0; that is, less than 1 % used modes 1,2, 


or 3 . Tables 3 and 4 give a hint as to why: although there 
were few different request sizes and interval sizes, there 
were often more than one, something not easily sup- 
ported by the automatic file modes. It may also be that 
these modes were slower than mode 0, so that pro- 
grammers chose not to use them. 

Sharing 

A file is shared if more than one job or process opens it. 
If the opens overlap in time, the file is concurrently shared. 
It is write-shared if one of the opens involves writing the 
file. In uniprocessor and distributed-system workloads, 
concurrent sharing is known to be rare.*^ In a parallel 
file system, concurrent file sharing among processes 
within a job is presumably the norm, while concurrent 
file sharing between jobs is likely to be rare. Indeed, in 
our traces we saw a great deal of file sharing within jobs, 
and no concurrent file sharing between jobs. The inter- 
esting question is how the individual bytes and blocks of 
the files were shared. Figure 8 shows the percentage of 
files (that were concurrently opened by multiple nodes) 
with varying amounts of byte and block sharing. There 
was more sharing for read-only files than for write-only 
or read-write files, which is not surprising given the 
complexity of coordinating write sharing. Indeed, 70% 
of read-only files had 100% of their bytes shared, while 
90% of write-only files had no bytes shared. While half 
of all read-write files (not shown in Figure 8) were 100% 
byte-shared, 93% of them were 100% block-shared, 
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Figure 8. CDF of file sharing between nodes in read- 
only and write-only files at byte and block granularity. 
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Figure 9. Results of compute-node caching simulation. Hit rates differed 
from job to job, with three distinct clumps, indicating that the cache 
either helped or did not. One buffer was as good as many buffers. 
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which would stress a cache consistency protocol, if 
present. Overall, the amount of block sharing implies 
strong interprocess spatial locality, and suggests that 
caching might be successful. 

Caching 

Buffering and caching are common in traditional file 
systems, and — with the right policies — can be suc- 
cessful in multiprocessor file systems. One advantage of 
buffers is to combine several small requests (which were 
common in this workload) into a few larger requests that 
can be more efficiently served by disk hardware. Indeed, 
with redundant disk arrays common on today’s multi- 
processors (such as the Intel Paragon and the KSR-2), 
it is even more important to avoid small requests at the 
disk level. Fortunately, the small requests seen in Figures 
4 and 5, when coupled with small interval size, lead to 
spatial locality. Other potential benefits may come from 
temporal or interprocess locality in the access pattern. 

In a distributed-memory machine, it is possible to place 
a buffer cache at the compute nodes, at the I/O nodes, or 
both. We evaluated all three with trace-driven simulation. 

Compute-node caching 

The amount of block sharing in write-only and read- 
write files show that any attempt to maintain write 
buffers at the compute nodes would necessitate a cache 
consistency protocol, so we restricted our effort to read- 
only files. The results of a simple trace-driven simula- 
tion of a compute-node cache of 4-Kbyte (one block), 
read-only buffers with least recently used replacement 
are shown in Figure 9. We consider a hit to be any 
request that was fiiUy satisfied from the local buffer (that 
is, with no request sent to an I/O node). 

Caching success, as indicated by a high hit rate, was 
limited to a subset of the jobs: 40% of the jobs had a 


greater than 75% hit rate, but 30% of 
the jobs had a 0% hit rate. Further, for 
those jobs where a cache was benefi- 
cial, a single one-block buffer per 
compute node was usually sufficient. 
A single buffer could maintain a high 
hit rate in patterns with a s mall request 
size (which was common; see Figures 
4 and 5) and a short (perhaps zero) 
interval size. Clearly there was spatial 
locality in our workload, and not much 
temporal locality, or multiple buffers 
would have helped more. (Multiple 
buffers were useful in very few jobs, 
apparently those that were interspersing reads from 
more than one file. In those cases, a single buffer per file 
would have been appropriate.) In short, it appears that 
a one-block buffer per compute node, per file, may be 
useful for read-only files, but a careful perfoinnance 
analysis is still necessary. 

I/O-node caching 

Given the apparent interprocess locality, I/O-node 
cadiing should be successful. To find out, we ran a trace- 
driven simulation of I/O-node caches, with 4 -Kbyte 
buffers managed by either a least recently used or FIFO 
replacement policy. These I/O-node caches served all 
compute nodes, all files, and all jobs, according to cur best 
guess of the event ordering witffin our traces. We a*sumed 
the file was striped in a round-robin feshion at a one-block 
grariularity. No compute-node cache was used. 

Figure 10 shows the results of the simulation With 
least-recently-used (LRU) replacement, a small cache 
(4000 4-Kbyte buffers over all I/O nodes) was sufficient 
to reach a 90% hit rate. With FIFO replacement, nearly 
20,000 buffers were needed to obtain a 90% hit rate, 
b^use FIFO does not ^ve preference to blocks with 
high locality. It made little difference whether the 
buffers were focused on a few I/O nodes or spread over 
many (that is, the hit rates were similar; performance is 
another issue). The success of such a small cache, cou- 
pled with the apparent lack of intraprocess locality in 
many jobs (see Figure 9), reconfirms the presence of 
interprocess spatial locahty. 

As a final test, we simulated the combination of a sin- 
gle buffer per compute node and a cache at each of 10 
I/O nodes. The result was a only a 3% reduction in the 
I/G-node hit rate when each I/O node had a small cache 
of 50 buffers. This further suggests that most of tf e hits 
in the I/O-node cache were indeed a result of i nter- 
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Figure 10. Results of l/O-node caching simulation. Each line represents a 
complete run of the simulation with a fixed number of I/O nodes 
ranging from 1 to 20. 


process locality because, as Figure 9 
shows, the limited intraprocess locali- 
ty was filtered out by the compute- 
node cache. 

In constrast, Miller and Katz's trac- 
ing study^ found little benefit from 
caching, although it did show a bene- 
fit from prefetching and write-behind. 

Both their workload and ours involve 
sequential access patterns; the differ- 
ence is that the small requests in our 
access pattern lead to intraprocess spa- 
tial locality, and the distribution of a 
sequential pattern across parallel com- 
pute nodes leads to interprocess spatial locality, both of 
which could be successfully captured by caching. 

A lthough this workload had many charac- 
teristics in conunon with those in previ- 
ous studies of scientific applications and 
file systems (large file sizes, sequential 
access, little inter-job concurrent sharing), 
parallelism had a significant effect on some workload 
characteristics (smaller request sizes, and lots of intra- 
job concurrent file sharing) and added some new char- 
acteristics (nonconsecutive sequential access and inter- 
process spatial locality), A multiprocessor used for 
scientific applications will not be well served by a file 
system ported from a distributed system, which was 
tuned for a different set of workload characteristics. In 
particular, parallelism leads to new, interleaved access 
patterns with no temporal locality, and high interprocess 
spatial locality at the I/O node. 

Compute-node caches are probably best implemented 
as a single buffer per file (but only if carefully managed 
for consistency). I/O-node caches can effectively com- 
bine small requests from many compute nodes, avoid- 
ing extraneous disk I/O and raising the potential for 
large disk I/Os, a significant benefit when the I/O nodes 
serve redimdant disk arrays (which favor large transfers) 
rather than individual disks. Replacement policies other 
than least- recently-used or FIFO can optimize for 
sequential access and interprocess locahty, rather than 
traditional spatial and temporal locality.® 

Ultimately, we believe that the file-system interface 
must change. The current interface forces the pro- 
grammer to break down large parallel I/O activities into ^ 
small, noncontiguous requests. While compute-node 
and I/O-node caching can help, it would be better to 
support I/O requests from the programmer’s inter- 


face to the compute node, and from the compute node 
to the I/O node. Astrided request can express a regular 
request and interval size (which were common in our 
workload), effectively increasing the request size, low- 
ering overhead, and perhaps eliminating the need for 
compute-node buffers. The Cray file system is an exam- 
ple of a system that supports strided requests. 

Some of our results may be specific to workloads on 
Intel CFS file systems, or to NASA Ames' workload (com- 
putational fluid dynamics). Although the exact numbers 
are workload-specific, we believe that our conclusions are 
applicable to scientific workloads running on loosely cou- 
pled MIMD multiprocessors with a CFS-like interface — 
that is, an interfece that encourages interleaved access and 
an independent file pointer for each node. This category 
includes many current multiprocessors. 

We plan to continue collecting traces from other 
machines and environments to broaden and deepen the 
experimental data, and strengthen the generality of our 
conclusions. One such project has already produced 
results.^ We may also convert these and other results 
into a meaningful, synthetic benchmark of parallel I/O. 
We will use this new knowledge to design a better multi- 
processor file system, and use the traces in trace-driven 
simulations of new policies for caching, prefetching, 
load balancing, paging, and so forth. We also plan to 
evaluate new I/O architectures.^ 
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